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ABSTRACT 

Nine studies of teaching critical thinking in 
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Teaching Critical Thinking at the University Level: 
A Review oi Some Empirical Evidence* 



LEONARD E . GIBBS Uni\/ersity of Wisconsin— Eau Claire 



This review was cQnducted specifical- 
ly to help us plan a critical thinking pro- 
gram for faculty at the University of 
Wisconsin— ftau Claire, and to evaluate 
its effects on faculty and students in 
their classrooms. It seemed appropriate 
to first weigh evidence before begin- 
ning such a program. Thus, the ques- 
tions listsQ in the abstract concern mea- 
sures of critical thinking, effectiveness 
of conventional curricula, effectiveness 
of curricula designed specifically to 
teach critical thinking, and factors asso- 
ciated with successful learning by parti- 
cipants. Because we thought empirical 
studies would provide the clearest 
answers to our questions, studies of 
critical thinking in universities are the 
only evidence included. 

Some clarifications may be helpful. 
Some otherwise excellent studies were 
excluded because they evaluated critic- 
al thinking at the pre-university level 
(Noyce, 1970; Smith & Tyler, 1942). 
The review begins with two tables. 
The first summarizes methodological 
features of studies reviewed; the 
second summarizes findings and mea- 
sures used to quantify those findings. 
Readers who want a detailed overview 
of each study and a feature-by-feature 
comparison of it with other studies, 
may find the tables and their explana- 
tions helpful. Readers who want a quick 
overview of the evidence may want to 
skip the tables and read the discussion. 
Discussion hits high points in the tables 
and follows the sequence of questions 
posed in the abstract. 

Criteria for Inclusion 

When choosing evidence, I intended 



to cast a net with a wide enough aper- 
ture to catch the best empirical evid- 
ence but not so wide that it caught 
a confusing mixture of weak and strong 
evidence. Ideally, the best criteria 
for inclusion would be: random selec- 
tion of subjects, measures of proven 
validity and reliability, random assign- 
ment of subjects to alternate programs 
for teaching critical thinking or to a 
control, specific hypotheses tested by 
appropriate inferential statistics, and 
sufficient follow-up to measure strength 
of effect over time. These, criteria were 
too rigorous. The first sweep of the r.st 
caught nothing. 

Nine studies did meet the following 
criteria: their authors studied effects 
of university level programs for teach- 
ing critical thinking; authors stated 
specifically that they were evaluating 
critical thinking; they used at least one 
measure of critical thinking to evaluate 
effects of teaching; they made some 
comparison (either pretest against 
postteLt or across groups), and authors 
used descriptive or inferential statis- 
tics. 

Because none had sufficient control 
over their experiment to randomly 
assign subjects to experimental condi- 
tions, designs summarized here are all 
quasi-experimental (Campbell and 
Stanley, 1963). Though there is no 
question that random assignment to 
alternate programs would enable more 
powerful causal inferences, random 
assignment is not the all inclusive facil- 
itator of high quality research (Cook 
and Campbell, 1979). Thus, studies re- 
ported here, from various contexts, 
often using different measures, can 
still provide tentative answers to our 
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questions* 

We located studies by asking ex- 
perts for references, by reading reviews 
on the subject (Norris, 1985; Baker, 
1979), and by searching DIALOG'S 
ERIC and other files for the intersect 
"Critical Thinking" and "Higher 
Education." The review reflects month- 
ly reviews of ERIC files through 
February, 1986. 

Explanations of Tables 

Table 1 shows how well each study 
meets several criteria for methodo- 
logical precision. It describes each 
study's merits and allows a quick com- 
parison across studies by criterion. 
The first column identifies each study 
by author and year. The second column 
contains the location of the study, 
where possible, and identifies the type 
ofclass or setting for subjects. Column 
three describes study design according 
to Cook and Campbell's (1979) term- 
inology. The fourth and fifth columns 
give the number of subjects pretested 
and the number posttested, thus pro- 
viding a quick reference to the number 
of subjects involved in the experiment 
and any subject attrition. The symbols 
Rs and Ra in columns six and seven 
respectively, denote whether subjects 
were randomly selected for inclusion 
in the study, or were randomly as- 
signed to alternate treatments or to 
control. A slash (/) through these 
symbols means randomization criteria 
were not met. Column 8 lists the period 
of follow-up, or the interval between 
pretest and posttest, if such a design 
is used. 

Column 9 lists the Credibility Index 
(CI) for each study. This index .is based 
on a Quality of Study Rating Form that 
lists nine criteria for a good evaluation 
study and accompanying instructions 
for identifying and weighting those 
evaluation criteria (Gibbs, 1985). 
Thirty-nine raters have used the form 
to rate two studies agreeing an average 
of 95% and 93% with keyed criteria. 
Stronger randomized trials generally 
score above 70 points on the form. 



CI is computed by adding weights 
for the following criteria: random selec- 
tion of subjects (10 points), random 
assignment (20 points), nontreated 
control or comparison group (10 
points), number of subjects in the 
largest treatment group exceeding 
twenty (10 points), a check of validity 
by correlating the principal outcome 
measure with another similar measure 
(16 points), a reliability coefficient 
for the principal measure of critical 
thinking (15 points), a reliability co- 
efficient of at least .70 or 70% agree- 
ment between raters (9 points), follow- 
up longer than six months (4 points), 
and using an inferential statistic to 
test comparisons for statistical signi- 
ficance (6 points). The CI can range 
from zero, in a study where none of the 
criteria are met, to one hundred, where 
all criteria are met. 

Table 2 lists measures of critical 
thinking, criteria for evaluating mea- 
sures, and summaries of study results. 
The first column identifies each study 
by author and year. The second des- 
cribes the location and type of univer- 
sity class providing subjects. Columns 
3 through 5 list respectively, the name 
of the measure or measures used to 
quantify critical thinking, the reliability 
coefficient or percent of inter-rater 
agreement for each measure, and in- 
formation relevant to validity. Column 
6 lists principal hypotheses; these may 
be explicitly stated by the author or 
implicit. Column 7 lists the statistical 
test and "p" (significance) level re- 
lated to each hypothesis. (Here "p" 
level generally means the probdoility 
that a given result could be found due 
to chance alone; so the smaller the "p" 
level the greater our confidence in 
difference reported.) Column 8 lists 
the strength of treatment effect (SE) 
in standard deviation units (Glass, 
1972; Hedges, 1984). This index is 
usually the mean of the experimental 
group minus the mean of the control 
group, all divided by the standard 
deviation of the control group, or the 
difference between treatments all 
divided by a pooled estimate of their 
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standard deviation. Especially perti* 
nent comments are in column 9. 



Findings 

Which kinds of instruments have 
been used most frequently by evalua- 
tors to measure university level critical 
thinking? Column 3 of Table 2 shows 
that the Watson-Claser Critical Think- 
ing Appraisal, a test whose forms A 
and B were copyrighted in 1951, is 
most popular; three authors used it. 
The eighty-item Watson-Claser is a 
multiple choice test of ability to dis- 
criminate among degrees of support 
for inferences, recognition of unstated 
assumptions, ability to make logical 
deductions, interpretation of evidence 
to see if generalizations or conclusions 
are warranted, and ability to judge 
the relevance of arguments to parti- 
cular questions (Watson and Glaser, 
1980). Two studies used a procedure 
for grading essay tests developed by 
Browne, Haas and Keeley (1978). 
Their rubric score*: ihe following ele- 
ments in student wjsays: identifying 
a controversy anH conclusions regard- 
ing that controversy, identifying major 
arguments, identifying and analyzing 
implicit premises, recognizing lan- 
guage difficulties (e.g. ambiguity and 
vagueness), evaluating validity of 
individual arguments and truth of in- 
dividual prerutses, formulat: ig a con- 
clusion from premises, and recognizing 
alternative inferences. Using this 
rubric, it tak'^s an hour to score a single 
essay. Each of the following tests were 
used in one study only: The American 
Council of Education's Test of Critical 
Thinking, Inclination toward Method- 
ological Criticism, Ability at Method- 
ological Criticism, Creative Reasoning 
Test, Florida Taxonomy of Cognitive 
Behavior, and the Cornell Critical 
Thinking Test. 

Just as there are a wide variety of 
instruments used to measure critical 
thinking, evaluations come from a 
wide range of disciplines and locations 
(See column 2 of Table 2.) Four authors 
evaluate classes across disciplines. 
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Others study effects of critical thinking 
programs on students from a single 
discipline including classes in mass 
communication, business, biology, and 
sociology. 

What are relative merits for essay 
versus multiple choice tests for critical 
thinking? Some argue that essay tests 
are more valid because essay tests 
measure application of critical thinking 
skill, not merely knowledge of prin- 
ciples (Browne, Haas, Vogt, <& West, 
1977). While using the Watson-Claser 
as their principal measure in a program 
at Bowling Creen State University, 
Browne and his colleagues found that 
students, though able to demonstrate 
knowledge of critical thinking on the 
Watson-Claser, still had trouble critic- 
ally evaluating essays and other exam- 
ples of thinking (Browne, Haas and 
Keeley, 1978). They argue that the 
multiple choice Watson-Claser may 
measure the ability to recognize a 
valid syllogism, but may not test the 
ability of students to apply valid de- 
ductive reasoning to a problem 
(Browne, Haas & Keeley, 1978). 
Evenhandedly, Browne and his asso- 
ciates concede that multiple choice 
tests are easy to use, have national 
norms, and take less time to score than 
do essay tests (Keeley, Browne, & 
Kreutzer, 1982). 

How reliable are tests of critical 
thinking? Reliability is vital to any 
evaluation, because consistent mea- 
sures help to rule out sources of varia- 
tion that can obscure real effects of 
educational programs. A rough rule of 
thumb for interpreting reSiability co- 
efficients is that the cfoser they ap- 
proach one the better. Values equal to 
or exceeding .70 are generally accept- 
able. 

Evaluators using multiple choice 
tests did not measure the reliability 
either of the Watson-Claser or the 
Cornell by using data from subjects 
participating in their evaluations. 
However, the manual for the Watson- 
Claser (1980) reports test-retest 
reliability (r=.75), alternate torms re- 
liability (r=.75 for Form A and Form 
B), and split-half reliability coefficients 
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Table 1 Credibility of Studies 



Author 
1 


Typeof Subiects 
2 


Study Design 


No. in Pretest 

A 
•1 


No. in Posttest 
5 


BaKer,P.J.& 
Anderson, L.E.. 
1983 


Students in three sections of a Social 
Problems course at Illinois State 
University 


TT)ree-group 

pretest- 

posttest 


Ni = 20 
N2 > 22 
N3 = 14 


Ni s 20 
N2 = 22 
Nas: 13 


Browne* M.N., 
Haa$,P.F.,Voot, 
K.E. &We9t. 
1977 


Treatment group was freshmen in special 
course. Comparison group was seniors In 
business major. 


Two-gro'jp 
pretest-postteat 


T X 21 

Gompar. s 40 


T = 21 

Compar. s 40 


Given9,C.pM 
1976 


40 randomly selected faculty and their 
students in classes at 4 universities 


One-group 
posttest only 


None 


4u Class Of Siuoenis 
In 4 universities 


Keeley, S.M., 
Browne. M.N.. & 
KreuUer. J.S., 
1982 


Students at a midwestern 
university 


Posttest only with 

nonequivalent 

groups 


500 freshmen 
500 seniors 


155 freshmen 
145 seniors 


Lehmann, I.J. & Students at Michigan State 
Dfessel, P. L.J 963 Univeroity 


One-group 
pretest-posttest 


Freshmen 
590 M 461 F 


Freshmen 
590 M 461 F 
Sophomore 
235 M 189 F 
Junior 179 M 144 fvl 


Logan, C.H. 
1976 


Students at B ievols in a large university, 
and one c()urse In critical thinking (all 
in sociology) 


Two group E group 
pretest-posttest ^ 64; 
with non* comparison 
equivalent groups, groups s 
Five groups post- 102. 30 
tested only 


E group s 67. 
comparison 
groups s 144, 
32. 36. 42. 
18. 


Meiss. G.T. & 
Bates. G.W.. 
1984 


Students in an introductory class 
in mass communication 


Three-group 
pretest-posttest 


N = 102 


Ni = 27 
N2= 26 
N3S 30 


Snfiith. D.G.. 1977; Students in 12 classes, where teaching 
and Smith . 0,Q. , critical thinking was not a specific goal , 
1^ (for nfiore de- at a small lit}6ral arts college 
tailed description 
of the study) 


One group 
pretest-posttest 
il2classes 
combined) 


N ^ 210 


N s 138 


StatKiewicz. W.n. 
& Allen. R.D.. 
1983 


One section of 1 12 General Biotogy 
students at West Virginia University 


One-group re- 
peated measures 
design (measures 


N s 43 


N s 48 



mad ; at three times 
during the semester) 



5 
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Random Selection 
6 


Random Assignment 
7 


Period of Follow-up 
8 


Credibility Index 
9 




\ 


One semester of class 


34 


(He 




Academic quarter 


26 


Rs (Classes 
selected randomly) 


N 


No lollow-up 


50 




\ 


none 


41 


"S 




12 3 vflar§ 


30 






One semester in 
experimental group. 


26 




(treatment assigned 
randomly to classes) 


15 weeks 


46 


\ 


^ 


One semester 


16 


Rs 


\ 


One semester 


42 



6 
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Author 



1 
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Measures of Critical Thinking and Findings 

Type of Subjects 



Measures of 
Critical Thinking 
3 



Reliability of Critical 
ThinKing Test 
4 



Validity of Critical 
Thinking Test 
5 



Bakflr, P J. & 
Anderson, L.E>, 
19dG 



Students in 3 sec- 
tions of asocial prob- 
blems course at llli* 
nois State University 



Creative Reasoning 
Test 



Inter-rater 

r s .70. .93, .96, 74 



Browne. fs^l.N., 
Haas. P.F., 
Vogt.K.E., 
&West. J.S.. 
1977 



Treatment group was The principal measure Graders scored 12? Authors argue that 

froahrriAn in vruw^iai u/Aft a riihir H«wi»w1 ASSAV tMtS and thfl eSAAV tASt mMi 



freshmen in special 
course. Comparison 
group was seniors in 
business major. 



was a rubic devised essay tests and 

by the authors to agreed within one 

gradeessay tests, plus letter grade on 

the Watson-Glaser all but 5of the 122 

and Cornell. tests. 



the essay test measures 
applied critical 
thinking. 



Givens. C.F. 
1976 



40 randomly selected 
faculty and (heirstu* 
dents in classes at 4 
universities 



Florida Taxonomy of 
Cognitive Behavior 
(FTCB) 



85% agreement on 
items for independent 
raters 



Items based on Bloom's 
Taxonomy of Education 
Objectives 



Keeley.S.M., 
Browne. M.N., 
& Kreutzer. J.S., 
1962 



Students at a Rubric developed by Interrater 

midwestern the authors for reliability s .90 

university grading essay tests 



Authors think multiple 
choice tests fail to 
measure ability to 
identify argument and 
to generate criticism 



Lehmann. I.J. & 
Dresset, P.L., 1963 



Students at Michigan 
State University 



American Council of 
Education's Test of 
Critical Thinking 



Smith, O.G . 1977: 
and Smith. D.G.. 
1983 (for more de- 
tailed description of 
the study) 



Students in 12 Watson-Glaser Critical 

classes, where critical Thinking Appraisal 
thinking was not a 
specific goal, at a small 
liberal arts college 



SiatKiewicz. W.R.. 
& Allen. P.O.. 1983 



One section of 112 
General Biology stu- 
dents at West 
Virginia University 



Practice Exercises (a 
forced choice, "de- 
fend your choice. " test 
developed by authors) 



Exercise correlated 
(r = .56, .59. and .71) 
with course examina- 
tion grade 



Logan 
1976 



C H 



Students at 6 levels 
in large university, 
ant} one course in 
critical thinking, 
(all in sociology) 



Inclination Toward 
Methodological 
Criticism: Ability at 
Methodological 
Criticism. 



Items exemplify 
common fallacies 



Meiss, G T. & Bates. Students m an mtro- 
G.W 1964 ductorydassnnmass 
communication 



Watson-Gla^er Critical 
Thinking Appraisal 
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Hyp. Tesied 
6 



Stat. Tested 
7 



8 



Comments 
9 



Students will improve theif critical think- Percent 
ing Skills pre to posttest in three classes improved 



637« had stg. to Tests were scrambled so raters did not 
moderate gains, know pretoet from posttest when scoring 
82% has greater 
posttest scores 



Thosff in a business and society cluster 
course wilt score higher on an essay test 
of CT ability than will seniora in a com*- 
parison group at posttest. Those in a 
business and society cluster will score 
significantly high^ at posttest than 
i^)eydid at pretest. 



F test for homo- 
geneity of 
variance, t test 
for difference 
of means 

P<.005 SE 
P<.005 SE 



The Watson^Glaserand Cornell test were 
dropped from the analysis because of 
difficulty in data collection and because 
6 students scored in the 65th percentile 
at pretest on WG. 



1.48 
1.49 



1 ) Th« average level of classroom dis- (See Comments) 

course Is on the lowest cognitive level t compar. of 

(FTCB) 2) There (a no difference for average medians 

professor's nor student 's rrCB score P< N. S. 

between course level (basic/advanced), P<N.S. 

subject area, time the class had been in P<N.S. 

session 3) Students in small classes had P< .003 
higher cognitive level {FTCB) than in 
larger classes 



Both professors and students had highest 
mean for item 5 of FTCB "Give A Specific 
Fact" 



Seniors wilt score higher on forms P ANOVA 
ano Cof an essay test of critical P< .05 

thinking than will freshmen 



No standard The authors think the statistically 
deviation so significant differences favoring seniors 
the SE can't ' mask small absolute differences. They're 
be computed concerned that 40-^% of seniors failed to 
provide a single example of a logical flaw, 
significant ambiguity, or misuse of data in 
a written passage. 



There will be stat. significant pre-to-post t for paired Freshmen 

differences at various levelb in critical samples M .83 F .65 

thinking among MSU students by acadorfilc Sophomore 

year. During Freshmen year (m + F) M .27 F .21 

p < .001 During Sophomore year (M + F) Junior 

p<.OCl During Junior year p = N.S. M -.01 F .07 

During Senior year p<.Ol(MK Senior 

p<.00l(F) M .12F .15 



The appendix does not contain a copy of 
the instrument. S('phomoresand Juniors 
are solected randomly. Freshmen and 
Seniors are pre-post tested. SE appears 
to be most pronounced in first wo years. 



The principal fiypotheses concern inter- 
actions. The three teaching process 
variables that were associated signifi- 
cantly with Watson-Glaser Class means 
were greater student participation In class 
discussion r s .63, P<.025: teacher 
encouragement r a .62, P<.025; higher 
peer to peer interaction r a .57, P<.05. A 
pre-post comparison of means for Watson- 
Glaser test was planned, but the com- 
parison was not made because the means 
w^re almost identical. 



Canonical Classes with low participation tended to 

correlation, SE » 0 decline in critical thinking, suggesting to 

bivariate (approx.) Smith that a decline in critical thinking 
correlation for may result In classes that emphasize 

interactions. memori;Elng and a lack of practice. 



Consistent execution of practice ANOVA Can't compute, 

exercises will lead to higher practice (no standard 

exercise scores In 1 2 meniber groups o? deviation} 
randomly chosen A, B, C. 0 grade level P< .009 
students 



The author's inferences that the practice 
exercises are a highly productive compo- 
nent of the program are weakly supported 
due to a lack of control or even a 
comparison group. 



Students who have more sociology courses Pearson 
will be more Inclined to think critically. r s -.24 
Students who have more sociology courses P< .01 
will be better able to think critically when 
they arespeelllcaUy Inatructed to do so. Pearson 
Findings: Freshmen and Sophomores In r s .01, 
a critical thinKing course identified an N.S. 
average of 2.3 and 2.4 fallacies of 10 
possible, more than in all eight levels of 
students including teaching assistants 
in other Classes. 



Can't compute Though students in the critical thinking 
SE because group only spotted an average of 2.3 to 2.4 
no standard among 10 possible errors In thinking, the 
deviation given range for those In introductory (7 » .29«.68) 
graduate students (7 s 1.3*1.40) was lower 
than the mean In a special critical thinking 
course. 



There will be statistically significant pre- ANOVA 

post difference scores on WGCTA among t-test 
Declarative Sentence Guide. Question 

Guide, and Topicoutline (control group). P s not 

Only group improved with stat. sig. was reported 
class with Declarative Sentence Guide 



A strength of this study is its random 
assignment of treatments tu intact classes. 



SE ^ .63 
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ranging from .69 to .89. Reliability 
for the Cornell (Ennis, Millman & 
Tomko, 1985) appears to be higher for 
Level X (the average for fourteen co- 
efficients is .80) than for Level Y (the 
average for fourteen coefficients is 
.71). For those intf.rested in a more de- 
tailed discussion of reliability for pub- 
lished tests, Ennis' (1984) critique 
will be helpful. Ennis says critical 
thinking may be multidimensional; 
so tests of reliability by measures of 
internal consistency may be mis- 
leading. 

Those using essay tests did evaluate 
the reliability of instruments used in 
their evaluations. Browne and others 
(1977) found that, when graders 
scored 122 essay tests, scorers agreed 
within one letter grade on all but five 
essays. Their criteria for grading essays 
may have been honed to a finer edge, 
because their more recent study, 
using the same rubric for scoring, 
reported a .90 inter-rater correlation 
(Keeley, Browne, & Kreutzer, 1982). 
Baker and Anderson (1983) reported an 
average inter-rater correlation of .83 
for their Creative Reasoning Test in 
a social problems course. 

Givens (1976) reported 85% agree- 
ment between raters who applied the 
Florida Taxonomy of Cognitive Be- 
havior (FATB) to audiotapes of class- 
room behavior. The FATB measures 
Bloom's taxonomy of educational ob- 
jectives including the following low to 
high hierarchy: knowledge of speci- 
fics , translation , interpretation , ap- 
plication , analysis, synthesis, and 
evaluation. 

Thus, it seems that multiple choice 
and essay tests for critical thinking can 
be scored reliably. 

Next is evidence regarding the effect- 
iveness of conventional curricula on 
critical thinking. Five studies evaluate 
standard curricula. Among these, 
four examine differences between 
advanced and less advanced students 
in the same university (Givens, 1976; 
Lehmann and Dressel, 1963; Logan, 
1976; Keeley, Browne, & Kreutzer, 
1982), and one study examines pre- 
post differences during one semester 



(Smith, 1977). 

Givens (1976) randomly selected 
forty faculty from four universities to 
represent large and small, public and 
private institutions. Givens' survey 
revealed no statistically significant 
difference between basic and advanced 
university students on the Florida 
Taxonomy of Cognitive Behavior 
(FATB). Her most striking finding may 
be that student and faculty discourse 
on the FATB averaged on the lowest 
cognitive level (knowledge), but pro- 
fessors were slightly lower, on the 
average, than were their students. 
This may reflect faculty who lecture 
and students who ask questions about 
lecture content. 

Lehmann and Dressel (1963) did a 
three-year longitudinal study of stu- 
dents at Michigan State University. 
They found a statistically significant 
improvement in critical thinking on 
freshman-to-sophomore, sophomore- 
to-junior, and junior-to-senior compari- 
sons. Strength of effect is substantial 
for the freshman year but drops sharply 
thereafter. Several problems with the 
study make interpreting these findings 
difficult. Their large sample may in- 
flate significance levels. The apparent 
improvement may reflect factors other 
than education, including maturation 
as students age, and effects of life ex- 
periences outside the university. 

Logan (1976) tested students' Incli- 
nation toward Methodological Criti- 
cism (students were instructed to just 
react to a series of ten statements 
containing common fallacies in thinking 
about social issues); he also tested their 
Ability at Methodological Criticism 
(students were instructed specifically 
to think clearly and scientifically about 
each statement). He applied his 
measures to 874 sociology students at 
eight levels, from freshmen to graduate 
teaching assistants, at a large mid- 
western university. He found a nega- 
tive correlation between number of 
sociology courses taken and inclina- 
tion to think critically (r = -.24,p<.01)! 
He concluded, "One plausible explana- 
tion is that what a lot of sociologists 
say and what they do are often very 
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different things. The professed concern 
among sociologists with teaching stu- 
dents to think more rationally and 
scientifically about social phenomena 
may be to a considerable degree lip 
service that masks a hidden curriculum. 
Sociology professors may in fact be 
more concerned with teaching students 
what to think than how to think." 

Keeley and others (1982) randomly 
selected 500 seniors and 500 freshmen 
(they got responses from 155 freshmen 
and from 145 seniors) among students 
at a midwestern university. They ad- 
ministered a reliable (r=.90) essay 
test to both groups. Seniors did statis- 
tically significantly better on the test, 
but Keeley and his associates con- 
sidered performance to be disappoint- 
ingly low for both groups. Across the 
items on the test, an average of 51% 
of freshmen and 42% of seniors got 
no points for items on the essay. 

Smith (1983) reported no difference 
over one semester for the Watson- 
Glaser. He pretested and posttested 
students in 12 classes at a small liberal 
arts college. 

The preceding results seem to show 
that conventional curricula, not de- 
signed specifically to teach critical 
thinking, may produce weak positive 
effects, no effect, or even harmful 
effects on critical thinking. However, 
these findings are hard to interpret. 
An apparent improvement maybe 
due to normal maturation of students 
during their collc^ge careers, possibly 
because student drop-out or "mortal- 
ity" may leave more competent stu- 
dents to take later measures, thus 
giving an illusion of an educational 
effect. Events other than those taking 
place in the university experience 
may change thinking ability. Without 
random assignment to educational 
programs, such alternate explanations 
for findings are numerous (Campbell 
& Stanley, 1963; Cook and Campbell, 
1979). 

Which faculty anu student behaviors 
are most associated with learning 
critical thinking? Such associations are 
especially important because they may 
suggest ways to design successful 



programs. Below are .'indings from 
three studies giving information about 
factors asso^:iated with students' 
learning critical thinking. 

Smith (1977) reports statistically 
significant associations between high 
Watson-Claser scores and greater 
student participation in class discussion 
(r = .63, p<.025), higher encourage- 
ment by the teacher (r=.62, p<.025), 
and higher peer-to-peer interaction 
(r=.57, p.<.05). Givens (1976) found 
that scores on the Florida Taxonomy 
of Education Objectives were higher for 
students in small classes and higher in 
large institutions. She also found 
a positive; association between "ana- 
lysis" by professors and performance 
at that level by students (r=.18, 
p<.03, N=155), but no consistent 
relationship between cognitive level 
of professors and corresponding cog- 
nitive level of students in their classes. 
Givens also found no significant 
difference in cognitive level of dis- 
course between professors and stu- 
dents by type oif institution (public 
or private), course level (beginning 
or advanced), subject area, time the 
class had been in session, or within 
or between institutions. Statkiewicz 
and Allen (1983) found that biology 
students who did exercise^ designed 
to force them to make choices did 
better on the final course grade. 

Though studies of association 
suggest factors that rnight be har- 
nessed to drive a critical thinking pro- 
gram toward its goals, association 
evidence is weak: characteristics 
of the learning environment that seem 
to affect critical thinking may them- 
selves be only associated with real 
causal factors. For example, peer-to- 
peer interaction may be associated 
with better performance, but itself 
may be only a reflection of some parti- 
cular feature of a well-conducted dis- 
cussion. 

How effective are^ special courses 
designed sperifically to coach critical 
thinking? Four Jtudies address this 
question. AW apply critical thinking to 
issues in the socia! sciences. 

Baker and Anderson (1983) think 
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most social problems courses merely 
teach students to memorize and re- 
call: such courses do not teach students 
to critically examine social issues. 
Baker and Anderson teach their stu- 
dents to scan the popular press to 
identify a problem commonly discussed 
there; define the problem; stipulate 
its various causes, ano offer general 
and specific solutions. They developed 
a Creative Reasoning Test to measure 
their intended goals and used it to 
evaluate the effects of three different 
teaching methods. Their Structured 
Inquiry method (where specific learn- 
ing goals are set for each student 
around some analytical thinking skill) 
produced the highest percentage of 
gain, but Focused Inquiry (where 
students select a topic, gather literature 
about it, and design a study), and Open 
Ended Inquiry (where students com- 
pared two different modes of inquiry 
including journalistic and sociological 
approaches) also produced substantial 
percentage gains. 

Meiss and Bates (1984) evaluated 
three methods for teaching critical 
thinking in an introductory mass com- 
munications class. Their methods in- 
cluded: a manual by Meiss employing 
declarative sentences to help students 
to use synthesis, application, and 
evaluation; the same manual used to 
pose thought-provoking questions, and 
a control group who merely got a topic 
outline. These methods were randomly 
assigned to three classes who attended 
the same lecture but were exposed to 
different methods in each quiz section. 
Analysis of variance comparisons were 
done at the end of the fifteenth week of 
the semester. The only statistically 
significant improvements on the 
Watson-Claser were among students 
exposed to the declarative sentence 
method. 

I was a student in the experimental 
course that Logan (1976) evaluated. 
The instructor for the course. Professor 
Michael Hakeem, used no text. He 
read aloud parts of the day's readings 
and "thought out loud" about the 



readings for the benefit of the class. 
Students in his class were given the 
chance to read critically and react aloud 
to thinking in professional articles, 
books, and stories in the popular press. 
He criticized these student reactions 
for the benefit of the class. We were 
encouraged to think about the method 
by which each author drew conclusions, 
and not get too involved in the content 
of the material. Tests were short essay 
based on readings and ideas given to 
us. Most failed the first essays because 
we merely parroted back the test mate- 
rial—an effective procedure in most 
other classes. 

According to Logan (1976), those who 
took this experimental freshman and 
sophomore course were able to spot 
an average of 1.79 fallacies among a 
possible ten on a scale measuring in- 
clination to think scientifically; they 
spotted 2.35 when told specifically 
to think scientifically. Not bad com- 
pared with graduate teaching assistants 
in the same department who scored 
1.11 and 1.92 respectively. 

The fourth and final evaluation of a 
course specifically designed to teach 
critical thinking is one by Browne and 
others (1977). They developed a new 
freshman-level business course. Its 
objectives were: developing critical 
thinking skills, developing respect for 
alternate viewpoints, and generating 
alternate hypotheses. They scored 
the essays of freshmen in the Business 
and Society Cluster course and con- 
currently scored essays done by a 
comparison group of senior business 
majors. Freshmen out performed 
seniors at posttest. Pretest scores were 
almost identical for freshmen and 
seniors. Of this they say, "This [no 
difference] result was surprising 
because we had expected the seniors 
to perform significantly better than the 
cluster [freshmen] students at pretest. 
Some further examination appears to 
be necessary to determine whether the 
application of critical skills is actually 
assimilated during a traditional four- 
year curriculum." 
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Conclusions 

The four studies reported immediate- 
ly above seem to indicate that critical 
thinking can be effectively taught at 
the university level. However, a caution 
is warranted. Not a single study among 
the nine reported here used random 
assignment to treatment groups nor 
to treatment and control groups. Thus, 
inferences about the effects of univer- 
sity teaching on critical thinking must 
be made with caution . 

It is not surprising that critical 
thinkers would omit a major criterion 
for making casual inferences when de- 
signing their experiments. It is like 
pulling teeth to extract data from busy 
faculty, especially controls. Our recent 
experience with a -andomized study of 
critical thinking at the University of 
Wisconsin— Eau Claire has made us 
much more appreciative of the studies 
reported here and respectful of prob- 
lems with randomized trials of new pro- 
grams. 

Students who have particular abil- 
ities, cognitive styles, experiences, and 
levels of motivation may benefit best 
from particular teaching approaches. 
But no aptitude-by-treatment inter- 
action (ATI) studies were found. Those 
who want to evaluate critical thinking 
programs from ATI perspective might 
base their procedures and hypotheses 
on ATI research in science teaching 
(Koran & Koran, 1984) and discussions 
of how to design such research (Cron- 
bach&Snow, 1977). 

The critical thinking movement 
seems to be gathering momentum. Re- 
cently, journals have devoted whole 
issues to teaching critical thinking (See 
the National Forum for Winter 1985) 
educators have initiated compulsory 
tests for C(*itical thinking statewide 
(Kneejier, 1965), and approximately 
nine hundred attended the Third Inter 
national Conference on Critical Think- 
ing and Educational Reform, where the 
conference director had conservatively 
expected from four to five hundred 
(Paul, 1985; R.W. Paul in a personal 
communication, June 23, 1986). 

Educators involved in the critic«3l 



thinking movement might be able to 
direct their efforts more effectively 
if more research were available to 
guide them. Such research might be 
more useful if randomized studies 
were available to evaluate aptitude-by- 
treatment interactions, the relative 
merits of different teaching approach- 
es, various aspects of the classroom 
environment including, for example, 
class size, and researchers met to 
isolate major dimensions of critical 
thinking and standardized measures 
for those dimensions. 
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Note 

*lf you are conducting an evaluation or 
know of one, plerse let me know. I anri 
happy to send reprints of this article. 
The author acknowledges comments 
and suggestions by Michael Hakeem, 
Pat Kark, John Morris, Diana Sigler 
and Michael Stratton, This review was 
supported by funds from the Under- 
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